Cross-Modal Mutual Learning for Audio-Visual Speech Recognition and Manipulation

نویسندگان

چکیده

As a key characteristic in audio-visual speech recognition (AVSR), relating linguistic information observed across visual and audio data has been challenge, benefiting not only audio/visual (ASR/VSR) but also for manipulating within/across modalities. In this paper, we present feature disentanglement-based framework jointly addressing the above tasks. By advancing cross-modal mutual learning strategies, our model is able to convert or audio-based features into modality-agnostic representations. Such derived representations allow one perform ASR, VSR, AVSR, manipulate output based on desirable subject identity content information. We extensive experiments different synthesis tasks show that performs favorably against state-of-the-art approaches each individual task, while ours unified solution tackle aforementioned

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Continuous Audio-visual Speech Recognition Continuous Audio-visual Speech Recognition

We address the problem of robust lip tracking, visual speech feature extraction, and sensor integration for audiovisual speech recognition applications. An appearance based model of the articulators, which represents linguistically important features, is learned from example images and is used to locate, track, and recover visual speech information. We tackle the problem of joint temporal model...

متن کامل

Cross-modal Visual-audio Priming

This study assessed whether presenting visual-only stimuli prior to auditory stimuli facilitates the recognition of spoken words in noise. The results of the study indicate that this type of cross-modal priming does occur. Future directions for research in this domain are presented.

متن کامل

Audio - Visual Speech Recognition

We have made signi cant progress in automatic speech recognition (ASR) for well-de ned applications like dictation and medium vocabulary transaction processing tasks in relatively controlled environments. However, for ASR to approach human levels of performance and for speech to become a truly pervasive user interface, we need novel, nontraditional approaches that have the potential of yielding...

متن کامل

CMCGAN: A Uniform Framework for Cross-Modal Visual-Audio Mutual Generation

Visual and audio modalities are two symbiotic modalities underlying videos, which contain both common and complementary information. If they can be mined and fused sufficiently, performances of related video tasks can be significantly enhanced. However, due to the environmental interference or sensor fault, sometimes, only one modality exists while the other is abandoned or missing. By recoveri...

متن کامل

Continuous Audio-Visual Speech Recognition

We address the problem of robust lip tracking, visual speech feature extraction, and sensor integration for audio-visual speech recognition applications. An appearance based model of the articulators, which represents linguistically important features, is learned from example images and is used to locate, track, and recover visual speech information. We tackle the problem of joint temporal mode...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence

سال: 2022

ISSN: ['2159-5399', '2374-3468']

DOI: https://doi.org/10.1609/aaai.v36i3.20210